{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inspect feature importance\n",
    "\n",
    "A good way to diagnose a model's performance is to examine which features it uses the most to make predictions, and which features don't seem to help at all. This is called feature importance analysis.\n",
    "\n",
    "To do so, we first load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from pathlib import Path\n",
    "from sklearn.externals import joblib\n",
    "\n",
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from ml_editor.data_processing import (\n",
    "    format_raw_df,\n",
    "    get_split_by_author,  \n",
    "    add_text_features_to_df,\n",
    "    get_vectorized_series, \n",
    "    get_feature_vector_and_label\n",
    ")\n",
    "from ml_editor.model_evaluation import get_feature_importance\n",
    "\n",
    "data_path = Path('../data/writers.csv')\n",
    "df = pd.read_csv(data_path)\n",
    "df = format_raw_df(df.copy())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we add features and split the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = add_text_features_to_df(df.loc[df[\"is_question\"]].copy())\n",
    "train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the pretrained model and vectorizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = Path(\"../models/model_1.pkl\")\n",
    "clf = joblib.load(model_path) \n",
    "vectorizer_path = Path(\"../models/vectorizer_1.pkl\")\n",
    "vectorizer = joblib.load(vectorizer_path) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We vectorize the text and get features ready for our model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[\"vectors\"] = get_vectorized_series(train_df[\"full_text\"].copy(), vectorizer)\n",
    "test_df[\"vectors\"] = get_vectorized_series(test_df[\"full_text\"].copy(), vectorizer)\n",
    "\n",
    "features = [\n",
    "                \"action_verb_full\",\n",
    "                \"question_mark_full\",\n",
    "                \"text_len\",\n",
    "                \"language_question\",\n",
    "            ]\n",
    "X_train, y_train = get_feature_vector_and_label(train_df, features)\n",
    "X_test, y_test = get_feature_vector_and_label(test_df, features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can leverage sklearn's api to display the most and least important features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "w_indices = vectorizer.get_feature_names()\n",
    "w_indices.extend(features)\n",
    "all_feature_names = np.array(w_indices)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top 10 importances:\n",
      "\n",
      "text_len: 0.0091\n",
      "are: 0.006\n",
      "what: 0.0051\n",
      "writing: 0.0048\n",
      "can: 0.0043\n",
      "ve: 0.0041\n",
      "on: 0.0039\n",
      "not: 0.0039\n",
      "story: 0.0039\n",
      "as: 0.0038\n",
      "\n",
      "Bottom 10 importances:\n",
      "\n",
      "horrifying: 0\n",
      "succeeds: 0\n",
      "fired: 0\n",
      "settling: 0\n",
      "settled: 0\n",
      "horrific: 0\n",
      "cms: 0\n",
      "pulling: 0\n",
      "coast: 0\n",
      "moonlight: 0\n"
     ]
    }
   ],
   "source": [
    "k = 10\n",
    "print(\"Top %s importances:\\n\" % k)\n",
    "print('\\n'.join([\"%s: %.2g\" % (tup[0], tup[1]) for tup in get_feature_importance(clf, all_feature_names)[:k]]))\n",
    "\n",
    "print(\"\\nBottom %s importances:\\n\" % k)\n",
    "print('\\n'.join([\"%s: %.2g\" % (tup[0], tup[1]) for tup in get_feature_importance(clf, all_feature_names)[-k:]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the text length is the most important feature. The next most important features seem like common English words. In order to use this model to provide useful writing suggestions, we should work on generating features values that would be **easier** to turn into actionable writing advice."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml_editor",
   "language": "python",
   "name": "ml_editor"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
